This page was intentionally left blank
Coding Replications
For coding replications, whenever applicable, please follow this page or hover on the specific slides with containing coding chunks.
.qmd
format, containing a thorough discussion for all examples that have been showcased. This file, that will be posted on eClass®, can be downloaded and replicated on your side. To do that, download the file, open it up in RStudio, and render the Quarto document using the Render button (shortcut: Ctrl+Shift+K
).A note on Tidy Data
“Tidy datasets are all alike, but every messy dataset is messy in its own way. Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning).”
Stock price information is an example of raw data that can be easily pulled from providers such as Yahoo! Finance. However, it is not often structured in a tidy and convenient way
tidyquant
package, which allows us to pull stock price data from multiple securities in a convenient format. You can hit the Download button to get a grasp on how the data looks likedplyr
verbs: recapdplyr
is a grammar of data manipulation, contained in the tidyverse
, providing a consistent set of verbs that help you solve the most common data manipulation challenges:mutate()
adds new variables that are functions of existing variablesselect()
picks variables based on their namesfilter()
picks cases based on their valuessummarise()
reduces multiple values down to a single summaryarrange()
changes the ordering of the rowsmutate()
functionDefinition
The mutate()
function adds new variables that are functions of existing variables:
.data
argument you provided (in your example, the Data
object)…data.frame
(or tibble
)mutate()
mutate()
function, practicehigh
and low
and create a new column, mid
, defined as the average between daily high and low prices. The M7
dataset has been already loaded for you. Even if you submit the wrong answer, a live-tutoring feature will provide you with a handful of tips to adjust your code and resubmit your solution.
First, use the mutate()
function to create the mid
column.
After that, define mid
as (high+low)/2
to calculate the average between the two values.
select()
functionDefinition
The select()
function select (and optionally rename) variables in a data frame, using a concise mini-language that makes it easy to refer to variables based on their name (e.g. a:f
selects all columns from a
on the left to f
on the right) or type (e.g. where(is.numeric)
selects all numeric columns)
.data
argument you provided (in your example, the Data
object)…select(.data,-variable)
to remove a variabledata.frame
intact - no rows are affectedselect()
function, continuedThe select()
function also comes with a handy companion of selectors
, which are functions that help you cherry pick columns in a concise way, rather than hardcoding them altogether:
:
for selecting a range of consecutive variables.starts_with()
starts with a stringends_with()
ends with a stringcontains()
contains a stringmatches()
matches a regular expression.where()
a function to all variables and selects those for which the function returns TRUE
select()
function, practicesymbol
, date
, volume
, and adjusted
, in that order. The M7
dataset has been already loaded for you. Even if you submit the wrong answer, a live-tutoring feature will provide you with a handful of tips to adjust your code and resubmit your solution.
filter()
functionDefinition
The filter()
function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE
for all conditions.
filter(.data, #The object which you are performing the operations
variable_1 >10, #Simple arithmetic operators
variable_2 %in% c('AAPL','MSFT','FORD'), #Pattern search
!(variable_3 %in% c('Boston','Mass','Silicon Valley')), #Negate pattern search
variable_4 >=10 & variable_3<= 4 | is.na(variable_4) #IF and OR conditions
)
.data
argument you provided (in your example, the Data
object)…filter()
filter()
function, practiceyear()
function with the date
variable to retrieve the year. The M7
dataset has been already loaded for you. Even if you submit the wrong answer, a live-tutoring feature will provide you with a handful of tips to adjust your code and resubmit your solution.
arrange()
functionDefinition
The arrange()
function reorders the rows of a data frame by the values of selected columns:
#Some Options, always in the following format: the object you are rearranging + the reordering scheme
arrange(.data, variable1) #Ascending by variable_1
arrange(.data, variable1, variable_2) #Ascending by variable_1 and then variable_2
arrange(.data, variable2, variable_1) #Ascending by variable_2 and then variable 1
arrange(.data, variable1, desc(variable_2)) #Ascending by variable_1, and then descending by variable_2
.data
argument you provided (in your example, the Data
object)…data.frame
(or tibble
)lag()
, lead()
, head()
, and tail()
arrange()
function, practicedate
(newest to oldest) and symbol
. The M7
dataset has been already loaded for you. Even if you submit the wrong answer, a live-tutoring feature will provide you with a handful of tips to adjust your code and resubmit your solution.
summarize()
functionDefinition
The summarise()
- or summarize()
- function creates a new data frame. It returns one row for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified.
.data
argument you provided (in your example, the Data
object)…data.frame
(or tibble
) by the aggregation functionssummarize()
function, practiceaverage
column, defined as the average adjusted
prices. You can use the mean()
function to get the average. Use the option na.rm=TRUE
inside the mean
function to make sure that NA
values are disregarded. The M7
dataset has been already loaded for you. Even if you submit the wrong answer, a live-tutoring feature will provide you with a handful of tips to adjust your code and resubmit your solution.
group_by()
You saw how the tidyverse
verbs helps us getting ahead of the game when it comes to data operations. In most cases, however, you may need to add an extra layer of complexity: perform operations groupwise:
For cases like this, we need to find a convenient way of repeating the same operation across subsets of our data
It goes without saying that there should be a function in the tidyverse
that makes this operation straightforward: fortunately, you can use group_by()
together with all previous dplyr
verbs!
group_by()
, continued.data
argument you provided (in your example, the Data
object)…avg
variable taking the average of x
within each tuple defined by the grouping variables (in this case, v1
,v2
, and v3
)grouped dataframe
, with the results of avg
displayed for each unique combination of v1
,v2
, and v3
summarize()
call again, but now grouping the data by symbol
firstIn the previous exercises, you have used the main dplyr
verbs to create, select, arrange, filter, and summarize data, one by one. In practical applications, however it is likely that you need more than one of these functionalities at the same time
It is tempting to do it piecewise:
#Start with the data
Data = read.csv('Data.csv')
#Mutate
Data = mutate(Data, new_var_1=var_1*10)
#Select
Data = select(Data, var_1,var_2,new_var_1,where(is.numeric))
#Filter
Data = filter(Data, new_var_1>5)
#Arrange
Data = arrange(Data, new_var_1,desc(var2))
#Summarize
Data = summarize(Data, new_var=mean(new_var_1,na.rm=TRUE))
Data
\(10\) times!%>%
dplyr
verbs, in isolation, are a great tool for data analysts, but what really makes them to shine is what glues them together: I introduce you the pipe (%>%
or |>
)%>%
, continuedRené Magritte was right when he claimed that, in The Treachery of Images (La Trahison des images), there was not a pipe
In fact, the pipe that is relevant for us, R
users, was only introduced recently, in the magrittr
package, which makes clear allusion to the artist
The pipe
operator (%>%
or |>
) helps you chain operations sequentially, in such a way that the output of one operation serves as the input of the subsequent one!
.data
input multiple times, you chain the operations using the pipe operator.data
argument once!)%>%
, continued#Instead of
Data = read.csv('Data.csv') #Start with the data
Data = mutate(Data, new_var_1=var_1*10)#Mutate
Data = select(Data, var_1,var_2,new_var_1,where(is.numeric))#Select
Data = filter(Data, new_var_1>5)#Filter
Data = arrange(Data, new_var_1,desc(var2))#Arrange
Data = summarize(Data, new_var=mean(new_var_1,na.rm=TRUE))#Summarize
#Do
Data = read.csv('Data.csv')%>% #Start with the data
mutate(new_var_1=var_1*10)%>% #Mutate
select(var_1,var_2,new_var_1,where(is.numeric))%>% #Select
filter(new_var_1>5)%>% #Filter
arrange(new_var_1,desc(var2))%>% #Arrange
summarize(new_var=mean(new_var_1,na.rm=TRUE))#Summarize
The pipe
operator lets you pass the object on its left-hand side to the first argument of the function on the right-hand side
Another nice feature in R
is lazy evaluation: function arguments are only evaluated if (and when) they are accessed. This allows us to refer to variables that will only be created within the pipe
without breaking the code!
\(\rightarrow\) Read: Tech stocks slump as China’s DeepSeek stokes fears over AI spending (Financial Times)
As part of your work as a buy-side analyst, you were asked to analyze how the Magnificent 7 performed after the DeepSeek
Follow the instructions and answer to the following question: which stock suffered the most during January 2025?
dplyr
verbs you’ve practiced so fardplyr
functions, like lag()
, prod()
, as.Date()
and drop_na()
data.frame
object that shows, for each symbol
, the monthly return on January, 2025, ordered from lowest-to-highest\(\rightarrow\) Suggested solution will be provided in the replication file for this lecture.
Instructions
The data, stored in M7.csv
, can be loaded using read.csv('M7.csv')
. You can download it using the link shown in Slide 4.
symbol
, date
, and adjusted
columns, and arrange the dataset from oldest to newestdate
variable, making sure to read it as a Date object using as.Date()
Year
variable and filter only on observations happening in 2025. You can use the year()
function to retrieve the year of a given Date
column.symbol
symbol
, a Return
variable that is defined as \(P_{t+1}/P_{t}\), where \(t\) refers to a date. You can use the lag()
function for thislag
produces an NA
whenever you try to lag the first observation. To make sure your data does not contain any NA
, call drop_na()
symbol
, a Cum_Return
variable that is defined as the cumulative return. Compounded returns over time can be written as \(\small \prod(1+R_t)=(1+R_1)\times(1+R_2)\times...\times(1+R_t)\). For this, you can use the prod()
function.symbol
and arrange the table from lowest-to-highest return. The function slice_tail(n=x)
retrieves the bottom x
observations, whereas slice_head(n=y)
retrieves the top y
.#Read the Data
M7%>%
#Select only the columns of interest
select(symbol,date,adjusted)%>%
#Make sure date is read as a Date object
mutate(date=as.Date(date))%>%
#Filter for observations happening in 2025
filter(year(date)==2025)%>%
#Arrange from chronological order
arrange(date)%>%
#Group by Symbol to perform the calculations
group_by(symbol)%>%
#Create the return
mutate(Return = adjusted/lag(adjusted,default = NA))%>%
#Remove NAs before doing the cumulative product
drop_na()%>%
mutate(Cum_Return = cumprod(Return)-1)%>%
#Select the latest observation from each symbol
slice_tail(n=1)%>%
#Select symbol, date, and cumulative return
select(symbol,date,Cum_Return)%>%
#Arrange from lowest-to-highest
arrange(Cum_Return)
# A tibble: 7 × 3
# Groups: symbol [7]
symbol date Cum_Return
<chr> <date> <dbl>
1 NVDA 2025-01-29 -0.106
2 AAPL 2025-01-29 -0.0184
3 TSLA 2025-01-29 0.0259
4 GOOG 2025-01-29 0.0344
5 MSFT 2025-01-29 0.0567
6 AMZN 2025-01-29 0.0765
7 META 2025-01-29 0.129
As you pave your way through the coding exercises, there are a couple of best practices that will make your life easier when dealing with data in an R
session:
Whenever you are loading data, make sure to refer to the correct path where the file is located. You can use the function getwd()
without any arguments to retrieve the current path, and setwd('C:/Users/you/newpath/')
to set up a new working directory
The easiest way to make this logic redundant is to place the .R
(or .qmd
) script in the same folder as the data file (in our case, M7.csv
). When you open the script, it will point to its own directory as the working directory - which will coincide with the data directory
getwd() #Gets the current directory. For Windows users, this is generally defaulted to C:/Users/USER/Documents
setwd('C:/Users/Lucas/Desktop') #changing this to desktop
list.files() #You can confirm if your data is listed in this directory
read.csv('M7.csv') #It works without entering the full path since M7 is in the current path
Ctrl + Enter
R
will run only the selected lines directly into your session, and the output from these lines will be prompted into your sessionCtrl+Shift+K
to render the output altogether